Characterizing and modeling the dynamics of onhne popularity 
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Online popularity has enormous impact on opinions, culture, policy, and profits. We provide 
a quantitative, large scale, temporal analysis of the dynamics of online content popularity in two 
massive model systems, the Wikipedia and an entire country's Web space. We find that the dynamics 
of popularity are characterized by bursts, displaying characteristic features of critical systems such as 
fat-tailed distributions of magnitude and inter-event time. We propose a minimal model combining 
the classic preferential popularity increase mechanism with the occurrence of random popularity 
shifts due to exogenous factors. The model recovers the critical features observed in the empirical 
analysis of the systems analyzed here, highlighting the key factors needed in the description of 
popularity dynamics. 

PACS numbers: 89.75.Hc, 89.20.-a 



The dynamics of information and opinions have been 
deeply affected by the existence of Web-mediated bro- 
kers such as blogs, wikis, folksonomies, and search en- 
gines, through which anyone can easily publish and pro- 
mote content online. This "second age of information" 
is driven by the economy of attention, first theorized by 
Simon [l]. Sources receiving a lot of attention become 
popular and have formidable power to impact opinions, 
culture, and policy, as well as advertising profit. The 
Web 2.0 and social media [2 not only modify traditional 
communication processes with new types of phenomena, 
but also generate a huge amount of time-stamped data, 
making it possible for the first time to study the dynam- 
ics of online popularity at the global system scale. 

In this letter we focus on the dynamics of popularity 
of Wikipedia topics and Web pages. As popularity prox- 
ies we have chosen the traffic of a document, expressed 
by the number of clicks to that page generated by a spe- 
cific population of users, and the number of hyperlinks 
pointing to a document. It is well documented that the 
statistical properties of these variables in the Web are 
very heterogeneous, with distributions characterized by 
fat tails roughly following power-law behavior [3]-[6] . Such 
distributions have been explained with models based on 
the rich-get-richer mechanism [7]-[9], but their validation 
from the point of view of the dynamical behavior is prob- 
lematic, mainly due to the difficulty to gather relevant 
data. The data sets utilized here, however, contain tem- 
poral information that makes it possible to observe the 
growth in popularity of individual topics or pages, and 
allows us to statistically characterize the microdynamics 
by which online documents gather popularity. 

Prior work on popularity dynamics has focused on 
news [10l[TT], videos [121 [13] ^ind music [14]. Here, we 
analyze three large scale data sets that we assembled 
about two information networks: the entire Wikipedia 
and the Chilean Web. Wikipedia is a large collaborative 
online encyclopedia with millions of articles and hundreds 



Table I: Descriptions of the data sets constructed for our 
study. The two Wiki collections refer to indegree (1) and traf- 
fic (2) of Wikipedia topics, while the Chile collection refers to 
indegree of Chilean Web pages. 



Wiki' 
Wiki^ 
Chile 



Vertices 



Period 



Temporal 
Resolution 



3,293,102 Jan 2001 - 
3,490,740 Feb 2008 
3,252,779 2001 - 



Mar 2007 1 sec. 

- Current 1 hour 

2006 1 year 



jorg|. By mining the full edit history of every article, we 
were able to reconstruct the entire Wikipedia structure 
at any past point in time. The raw data was available un- 
til March 2007 (download.wikiinedia.org). Traffic data 



of thousands of registered contributors ( en. wikipedia. | 



with hourly temporal resolution was obtained by cross- 
referencing with a separate data set originating from 
Wikipedia proxy server logs (dammit .It /wiki stats]). 
Our third data source is a yearly sequence of crawls of the 
Chilean Web, made available by courtesy of the TodoCL 
search engine (www.todocl.com). This data consists of 
one complete crawl of the [ 



.cl| top-level domain for each 
of the years 2002-2006. Basic statistics on each data set 
are shown in Table |T| The representative graphs of these 
data sets have an approximately power-law distribution 
of indegree [T5HT7] . like the Web graph at large. 

In order to gauge quantitatively the popularity of doc- 
uments we consider the number of hyperlinks pointing 
to a page (indegree k in the graph representation of the 
Web [3 ), and the traffic s of the page, expressed by the 
number of clicks to it. Given either of these two popular- 
ity proxies Xt at time t, we study its logarithmic derivative 
[Ax/x]t = {xt — xt-i)/xt-i, which represents the relative 
variation of the measure in the time unit. 

Fig. [l] shows the logarithmic derivative of the indegree 
vs time for an example page in the English Wikipedia. 
Despite a roughly exponential growth, the logarithmic 
derivative provides a signature by which different topics 
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Figure 1: Time series of indegree k and its logarithmic deriva- 
tive Ak/k for the Wikipedia topic page about the artist Jen- 
nifer Hudson. Topics typically experience a burst in their 
early life. Here we observe later fluctuations as well. Jennifer 
Hudson became popular through a television show leading to 
her first burst. Another occurred when she won an Academy 
Award; degree popularity doubled as many other pages linked 
to the article (inset). The size of each circle shows another 
popularity measure; it is proportional to the log- derivative of 
the number of times the article is revised. The article receives 
more edits when it attracts more links. 



can be compared on the same scale. Almost all pages 
experience a burst in Ax/x near the beginning of their 
life. Many pages receive little attention thereafter. While 
some pages maintain a nearly constant positive logarith- 
mic derivative indicating an exponential growth, a num- 
ber of pages continue to experience intermittent bursts 
in Ax/x later in their life as in the example. 

The distribution of magnitude Ax/x for the two pop- 
ularity measures at representative time resolutions is il- 
lustrated in Figs. [2^-c. In all cases and at all granular- 
ity we observe a heavy-tail behavior. Such heavy-tailed 
burst magnitude distributions suggest a dynamics lacking 
a characteristic scale. This is typical in a wide range of 
"critical" physical, economic, and social systems, such as 
avalanches, earthquakes, stock market crashes and hu- 
man communication [19-23 . Further evidence comes 
from the study of the distribution of the length of inter- 
event intervals. For each document we record the time 
stamp of each event for which Ax/x > 1 and measure the 
inter-event times At. The probability distributions of At 
in the different data sets (Fig. [2]i) are not distributed 
following a Poissonian, as expected by queueing theory 
in traditional systems, but in a power-law fashion with 
a finite size cutoff, as in Omori's law of earthquakes [24] 
and other self-organized criticality phenomena [25^. 

The clear evidence for the bursty behavior of online 
popularity dynamics calls for a stylized model able to 
explain the observed features in terms of the already ac- 
quired popularity of each page and the shifts in collective 
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Distributions of popularity burst size. 
The gray areas highlight the events for which Ak > k (hence 
Ak/k > 1). Maximum likelihood methods ^8] in conjunction 
with the Kolmogorov- Smirnoff (KS) statistic rule out lognor- 
mal fits. In each case the KS statistic suggests that the power- 
law curve is the better fit for the tail. For the distribution 
of Ak/k in Wikipedia (a) the parameters are a = 2.6 for the 
exponent of the power law, with a lower cutoff of 12 and a 
KS statistic of 0.005. For the Web (b) we find a = 1.9 for the 
exponent of the power law, with a lower cutoff of 42 and a KS 
statistic of 0.007. For the distribution of As/s the parame- 
ters are a = 2.1 with lower cutoff 90 and KS statistic 0.007. 
The slopes of the best fit power laws are shown as guide to 
the eye. These behaviors are consistent across a wide range 
of temporal resolutions, as observed using time units from a 
day to a year, (d) Distribution of the time interval At be- 
tween consecutive indegree bursts of Wikipedia articles. We 
consider bursts such that Ak/k > 1 after January 1^*, 2003. 
The three curves correspond to different time resolutions of 
months, weeks, and days, aligned on the x-axis for ease of 
visualization. As we increase the resolution the tail of the 
distribution extends further, an indication that the cutoff is a 
finite size effect. As a guide to the eye we show a power law 
P(At) - (At)-^ with 13 ^ 0.8. 



attention triggered by exogenous events. 

The rich-get-richer mechanism can be simulated with 
the classic linear preferential attachment model [9 , in 
its directed version [26 , or with the ranking model by 
Fortunato et al. [27]. In the latter items are ranked ac- 
cording to their popularity x, and the probability that an 
existing item i receives a unit (e.g., a click) is P{i) ~ r~ , 
where r^ is the rank of i and (5 > is a free parameter that 
tunes the power-law popularity distribution P{x) ~ x~^, 
such that 7 = 1 + 1/(5. Both preferential attachment 
and ranking models, however, fail to reproduce the long 
tails observed in the distributions of both Ax/x and At 
(Figs. [3|i-b). Neither model accounts for the occurrence 
of exogenous factors that shift the attention of users and 
suddenly increase the popularity of specific topics be- 
cause of events such as an actor winning a prize, polit- 
ical elections, etc. The minimal assumption in model- 
ing exogenous perturbation consists in considering exter- 
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Figure 3: (a) Comparison of the empirical burst size dis- 
tributions with what would be expected from a preferential 
attachment (PA) process. Extensive numerical tests and max- 
imum likelihood fitting 18 show that PA generates an ap- 
proximately lognormal distribution (defined inside the gray 
area) inconsistent with the long tail observed in the empirical 
data, (b) The empirical inter-burst time distributions over- 
lap when time is expressed in terms of the same unit (in the 
figure, the common time unit is one day). The distribution 
generated by PA is much narrower and fits an exponential 
P(At) - e"^*/^ with r = 0.8. (c,d) The rank-shift model, 
despite its simplicity, reproduces quite well the distributions 
of both event size (c) and inter-event time (d). 



nal stochastic events interfering with the basic rich-get- 
richer mechanism by suddenly changing the popularity of 
a topic. The simplest way to implement this mechanisms 
consists in introducing in the ranking model a rerank- 
ing probability p, such that at each iteration every item 
is moved to a new position toward the front of the list, 
chosen randomly with equal probability between 1 (the 
top position) and the node's current rank j. We call this 
the rank-shift model [28j. 

In Fig. [4^ and ^p we show the indegree distribution of 
the rank-shift model for several values of p: S = 1 (a) 
and S = 1.5 (b). The ranking model {p = 0) yields the 
slope l-\-l/6 indicated by the dashed line. The reranking 
probability introduces an exponential cutoff in the distri- 
bution, which becomes relevant for p ^ 10~^ and larger 
(but we used 10~^ < p < 10~^ in our simulations). 

The distribution of Ak/k shows two distinctive fea- 
tures, which are remarkably found in the empirical dis- 
tributions: a maximum located in the range 0.01-0.1 and 
a fat tail. Since the reranking probability is low, to un- 
derstand the existence and the location of the maximum 
it is convenient to consider the model in the absence of 
the reranking mechanism. At a large time T, the ex- 
pected value of the degree of the node with rank r is 
proportional to Lr~^ ^ where L is the number of links 
present in the network at time T. Let AL be the number 
of links added during the interval AT at whose extremes 
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Figure 4: Rank-shift model, (a), (b). Indegree distribution: 
5 = 1 (a), (5 = 1.5 (b). (c) Comparison of the distribution of 
popularity bursts for the ranking model [27^ (circles) and a 
stylized model built upon the simple assumptions of growth 
described in the text, (d) Comparison of the distribution of 
popularity bursts with the expected slope derived by assuming 
that nodes are reranked at most once. 



the ratio Ak/k is computed. Let AL <C L, an assump- 
tion verified in our calculations. Therefore, one can safely 
assume that in the period AT the addition of new links 
does not affect significantly the degree of nodes and their 
relative ranking. So one can regard the growth process as 
a multinomial process with probabilities p{r) ex r~^ . The 
expected number Ak of new links acquired by a node of 
rank r is therefore p{r)AL. The assumption of (almost) 
stationarity also provides that k{r) ~ p{r)L. We there- 
fore expect Ak/k for a node to be distributed around 
AL/L, regardless of the node. In Fig.Efc we compare the 
simulation of the ranking model with the one of the multi- 
nomial process with p{r) ex r~^ , by using the parameters 
relative to the Wikipedia data set of January 2003, which 
represents an ideal tradeoff between the needs of having 
a sufficient number of bursts and a system size not too 
large for the model to run. The number of nodes/pages 
was A^ ^ 1.3 • 10^, the number of hyperlinks L ^ 1.3 • 10^ 
and AL ^ 8 • 10^. Based on the above discussion we ex- 
pect to observe a maximum in the distribution of Ak/k 
located at AL/L ^ 0.06. This is exactly where the max- 
ima of the empirical distributions of popularity bursts 
are located (see Fig. 2a). 

The ranking model cannot reproduce the fat tail ob- 
served in the real data. This is the reason why we in- 
troduced the reranking mechanism in our model. Here, 
it is the nodes that are suddenly promoted to a higher 
rank that are responsible for the high values of Ak/k 
in the simulations. We consider a node that at time 
T (the reference time at which we start measuring Ak) 
has rank ri, and is immediately promoted to rank r2, 
with r2 chosen uniformly in 1 < r2 < ri. Under the 



same assumption of stationarity that we made above, 
the expected degree of the node before promotion is 
k{ri) ^ Lp{ri) oc r^^ . Let us further assume that 
p <C 1 and that AL <C L, which hold for the parame- 
ters used in our model. Since the reranking probability 
is small, we can safely assume that no node is reranked 
more than once during the observation time AT. The 
expected number of links collected during the period AT 
is then Ak = ALp{r2) oc r^ . We expect therefore 
Ak/k oc {r2/ri)~^. It is straightforward to derive the 
distribution P{Ak/k) for a generic node that is promoted 
at the beginning of AT by considering all pairs of values 
ri, r2 uniformly distributed in 1 < r2 < ri < A^. We 
find P{Ak/k) (X {Ak/k)-^^^^/^\ In Fig. [i]! we highlight 
the tail of the distribution P{Ak/k) as produced by the 
rank-shift model and our expectation for its slope: the 
match is surprisingly good. 

Simulations of the rank-shift model were performed us- 
ing parameters matching those from the empirical data 
(e.g., N = 2.Sx 10^ nodes for the Wikipedia in 2003); 
the free model parameters were set to fit the empirical 
distributions: 1 < ^ < 1.2 and 10"^ < P < 10"^. For 
p = we recover the original ranking model, which yields 
a lognormal distribution of Ax/x^ like the preferential 
attachment (Fig. Isk). For p > numerical simulations 
show that the tail of the popularity burst magnitude dis- 
tribution shifts from a lognormal to a power law. The 
popularity distribution itself remains a power law; its ex- 
ponent remains 7 = 1 + 1/^, but with an exponential 
cutoff depending on p. 

Such a parsimonious model is able to reproduce the 
most relevant features observed in the empirical data. 
Not only does rank-shift predict the distributions of both 
popularity measures in our data sets, but also the long 
tails of the distributions of indegree and traffic burst 
size (Fig. [sj^). Furthermore, it naturally accounts for 
the maxima of the empirical distributions. Remarkably 
the model captures the long-range distribution of inter- 
burst intervals as well (Fig. Isli). The random rank-shift 
mechanism is therefore able to capture the way in which 
Web sites and pages gain and accumulate popularity: not 
by a gradual proportional process, but by a sequence of 
bursts that move them to the forefront of people's at- 
tention. Such bursts are different from those observed 
in news-driven events [10 , where attention fades rapidly 
and overall popularity is lognormal-distributed. We also 
found that smaller rank shifts are unable to capture the 
critical burst behavior observed in the data [28] . 

At the present stage our model is mostly descriptive 
and simply aims at reproducing at the coarsest level the 
distributions that characterize popularity changes. Pos- 
sible refinements may include the effect of search engines, 
external events, news, word of mouth, social media, mar- 
keting campaigns, or any combination of them. The 
study of traffic patterns and models [6l [29l [30] may help 
shed empirical light on this question. 
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